8 research outputs found
Environmental Sound Classification with Parallel Temporal-spectral Attention
Convolutional neural networks (CNN) are one of the best-performing neural
network architectures for environmental sound classification (ESC). Recently,
temporal attention mechanisms have been used in CNN to capture the useful
information from the relevant time frames for audio classification, especially
for weakly labelled data where the onset and offset times of the sound events
are not applied. In these methods, however, the inherent spectral
characteristics and variations are not explicitly exploited when obtaining the
deep features. In this paper, we propose a novel parallel temporal-spectral
attention mechanism for CNN to learn discriminative sound representations,
which enhances the temporal and spectral features by capturing the importance
of different time frames and frequency bands. Parallel branches are constructed
to allow temporal attention and spectral attention to be applied respectively
in order to mitigate interference from the segments without the presence of
sound events. The experiments on three environmental sound classification (ESC)
datasets and two acoustic scene classification (ASC) datasets show that our
method improves the classification performance and also exhibits robustness to
noise.Comment: submitted to INTERSPEECH202
Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model
Integrating large language models (LLMs) into healthcare presents potential
but faces challenges. Directly pre-training LLMs for domains like medicine is
resource-heavy and sometimes unfeasible. Sole reliance on Supervised
Fine-tuning (SFT) can result in overconfident predictions and may not tap into
domain specific insights. Addressing these challenges, we present a multi-stage
training method combining Domain-specific Continued Pre-training (DCPT), SFT,
and Direct Preference Optimization (DPO). A notable contribution of our study
is the introduction of a 3Gb Chinese Medicine (ChiMed) dataset, encompassing
medical question answering, plain texts, knowledge graphs, and dialogues,
segmented into three training stages. The medical LLM trained with our
pipeline, Qilin-Med, exhibits significant performance boosts. In the CPT and
SFT phases, it achieves 38.4% and 40.0% accuracy on the CMExam, surpassing
Baichuan-7B's 33.5%. In the DPO phase, on the Huatuo-26M test set, it scores
16.66 in BLEU-1 and 27.44 in ROUGE1, outperforming the SFT's 12.69 and 24.21.
This highlights the strength of our training approach in refining LLMs for
medical applications
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset
Recent advancements in large language models (LLMs) have transformed the
field of question answering (QA). However, evaluating LLMs in the medical field
is challenging due to the lack of standardized and comprehensive datasets. To
address this gap, we introduce CMExam, sourced from the Chinese National
Medical Licensing Examination. CMExam consists of 60K+ multiple-choice
questions for standardized and objective evaluations, as well as solution
explanations for model reasoning evaluation in an open-ended manner. For
in-depth analyses of LLMs, we invited medical professionals to label five
additional question-wise annotations, including disease groups, clinical
departments, medical disciplines, areas of competency, and question difficulty
levels. Alongside the dataset, we further conducted thorough experiments with
representative LLMs and QA algorithms on CMExam. The results show that GPT-4
had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results
highlight a great disparity when compared to human accuracy, which stood at
71.6%. For explanation tasks, while LLMs could generate relevant reasoning and
demonstrate improved performance after finetuning, they fall short of a desired
standard, indicating ample room for improvement. To the best of our knowledge,
CMExam is the first Chinese medical exam dataset to provide comprehensive
medical annotations. The experiments and findings of LLM evaluation also
provide valuable insights into the challenges and potential solutions in
developing Chinese medical QA systems and LLM evaluation pipelines. The dataset
and relevant code are available at https://github.com/williamliujl/CMExam
Calibrate and Refine! A Novel and Agile Framework for ASR-error Robust Intent Detection
The past ten years have witnessed the rapid development of text-based intent
detection, whose benchmark performances have already been taken to a remarkable
level by deep learning techniques. However, automatic speech recognition (ASR)
errors are inevitable in real-world applications due to the environment noise,
unique speech patterns and etc, leading to sharp performance drop in
state-of-the-art text-based intent detection models. Essentially, this
phenomenon is caused by the semantic drift brought by ASR errors and most
existing works tend to focus on designing new model structures to reduce its
impact, which is at the expense of versatility and flexibility. Different from
previous one-piece model, in this paper, we propose a novel and agile framework
called CR-ID for ASR error robust intent detection with two plug-and-play
modules, namely semantic drift calibration module (SDCM) and phonemic
refinement module (PRM), which are both model-agnostic and thus could be easily
integrated to any existing intent detection models without modifying their
structures. Experimental results on SNIPS dataset show that, our proposed CR-ID
framework achieves competitive performance and outperform all the baseline
methods on ASR outputs, which verifies that CR-ID can effectively alleviate the
semantic drift caused by ASR errors.Comment: Submit to INTERSPEECH 202
Modeling Label Dependencies for Audio Tagging with Graph Convolutional Network
As a multi-label classification task, audio tagging aims to predict the presence or absence of certain sound events in an audio recording. Existing works in audio tagging do not explicitly consider the probabilities of the co-occurrences between sound events, which is termed as the label dependencies in this study. To address this issue, we propose to model the label dependencies via a graph-based method, where each node of the graph represents a label. An adjacency matrix is constructed by mining the statistical relations between labels to represent the graph structure information, and a graph convolutional network (GCN) is employed to learn node representations by propagating information between neighboring nodes based on the adjacency matrix, which implicitly models the label dependencies. The generated node representations are then applied to the acoustic representations for classification. Experiments on Audioset show that our method achieves a state-of-the-art mean average precision (mAP) of 0:434
Environmental Sound Classification with Parallel Temporal-spectral Attention
Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, temporal attention mechanisms have been used in CNN to capture the useful information from the relevant time frames for audio classification, especially for weakly labelled data where the onset and offset times of the sound events are not applied. In these methods, however, the inherent spectral characteristics and variations are not explicitly exploited when obtaining the deep features. In this paper, we propose a novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands. Parallel branches are constructed to allow temporal attention and spectral attention to be applied respectively in order to mitigate interference from the segments without the presence of sound events. The experiments on three environmental sound classification (ESC) datasets and two acoustic scene classification (ASC) datasets show that our method improves the classification performance and also exhibits robustness to noise